The term xG in football stands for ‘expected goals’. It is a statistical measurement of the quality of goalscoring chances and the likelihood of them being scored.
The goal of this task is to use shots and frames data to identify what could be the best predictors for goals, so that we have a new metric (xG) created, that calculates the probability of a shot being scored.
First, the 2 dataframes used for this task are loaded, shots_df (containing shot data), and frames_df (containing freeze frame data that represents players other than the shooter with their locations and positions at the time of each shot).
shots.df <- read.csv("./shots_df.csv")
frames.df <- read.csv("./shots_freeze_frames_df.csv")
Data is explored and some summary statistics applied.
summary(shots.df)
## id period timestamp minute
## Length:2816 Min. :1.000 Length:2816 Min. : 0.00
## Class :character 1st Qu.:1.000 Class :character 1st Qu.:26.00
## Mode :character Median :2.000 Mode :character Median :49.00
## Mean :1.548 Mean :48.51
## 3rd Qu.:2.000 3rd Qu.:72.00
## Max. :2.000 Max. :96.00
##
## second possession duration location
## Min. : 0.00 Min. : 2.0 Min. :0.0018 Length:2816
## 1st Qu.:14.00 1st Qu.: 59.0 1st Qu.:0.4652 Class :character
## Median :29.00 Median :112.0 Median :0.9420 Mode :character
## Mean :29.31 Mean :109.8 Mean :0.9960
## 3rd Qu.:44.00 3rd Qu.:160.0 3rd Qu.:1.4098
## Max. :59.00 Max. :241.0 Max. :5.3096
##
## under_pressure type.id type.name possession_team.id
## Mode:logical Min. :16 Length:2816 Min. :746.0
## TRUE:427 1st Qu.:16 Class :character 1st Qu.:966.0
## NA's:2389 Median :16 Mode :character Median :969.0
## Mean :16 Mean :938.6
## 3rd Qu.:16 3rd Qu.:971.0
## Max. :16 Max. :974.0
##
## possession_team.name play_pattern.id play_pattern.name team.id
## Length:2816 Min. :1.000 Length:2816 Min. :746.0
## Class :character 1st Qu.:1.000 Class :character 1st Qu.:966.0
## Mode :character Median :2.000 Mode :character Median :969.0
## Mean :2.826 Mean :938.7
## 3rd Qu.:4.000 3rd Qu.:971.0
## Max. :9.000 Max. :974.0
##
## team.name player.id player.name position.id
## Length:2816 Min. : 4633 Length:2816 Min. : 2.0
## Class :character 1st Qu.:10188 Class :character 1st Qu.:12.0
## Mode :character Median :15579 Mode :character Median :17.0
## Mean :13483 Mean :15.9
## 3rd Qu.:16379 3rd Qu.:22.0
## Max. :24747 Max. :25.0
##
## position.name shot.statsbomb_xg shot.end_location shot.key_pass_id
## Length:2816 Min. :0.005823 Length:2816 Length:2816
## Class :character 1st Qu.:0.022912 Class :character Class :character
## Mode :character Median :0.048040 Mode :character Mode :character
## Mean :0.102732
## 3rd Qu.:0.115396
## Max. :0.887769
##
## shot.one_on_one shot.aerial_won shot.technique.id shot.technique.name
## Mode:logical Mode:logical Min. :89.00 Length:2816
## TRUE:151 TRUE:186 1st Qu.:93.00 Class :character
## NA's:2665 NA's:2630 Median :93.00 Mode :character
## Mean :92.96
## 3rd Qu.:93.00
## Max. :95.00
##
## shot.outcome.id shot.outcome.name shot.type.id shot.type.name
## Min. : 96.00 Length:2816 Min. :62.00 Length:2816
## 1st Qu.: 97.00 Class :character 1st Qu.:87.00 Class :character
## Median : 98.00 Mode :character Median :87.00 Mode :character
## Mean : 98.18 Mean :86.16
## 3rd Qu.:100.00 3rd Qu.:87.00
## Max. :116.00 Max. :88.00
##
## shot.body_part.id shot.body_part.name match_id competition_id
## Min. :37.00 Length:2816 Min. :19714 Min. :37
## 1st Qu.:38.00 Class :character 1st Qu.:19739 1st Qu.:37
## Median :40.00 Mode :character Median :19765 Median :37
## Mean :39.03 Mean :19766 Mean :37
## 3rd Qu.:40.00 3rd Qu.:19793 3rd Qu.:37
## Max. :70.00 Max. :19822 Max. :37
##
## season_id shot.open_goal shot.first_time shot.redirect shot.deflected
## Min. :4 Mode:logical Mode:logical Mode:logical Mode:logical
## 1st Qu.:4 TRUE:29 TRUE:446 TRUE:11 TRUE:25
## Median :4 NA's:2787 NA's:2370 NA's:2805 NA's:2791
## Mean :4
## 3rd Qu.:4
## Max. :4
##
## shot.saved_to_post location.x location.y shot.end_location.x
## Mode:logical Min. : 58.0 Min. : 5.00 Min. : 84
## TRUE:4 1st Qu.: 97.0 1st Qu.:34.00 1st Qu.:115
## NA's:2812 Median :105.7 Median :41.00 Median :119
## Mean :103.7 Mean :40.62 Mean :116
## 3rd Qu.:111.0 3rd Qu.:47.83 3rd Qu.:120
## Max. :120.0 Max. :78.60 Max. :120
##
## shot.end_location.y shot.end_location.z
## Min. : 2.00 Min. :0.000
## 1st Qu.:36.60 1st Qu.:0.600
## Median :40.10 Median :1.300
## Mean :40.29 Mean :1.726
## 3rd Qu.:43.80 3rd Qu.:2.400
## Max. :80.00 Max. :7.600
## NA's :853
summary(frames.df)
## id location.x location.y teammate
## Length:34476 Min. : 2.0 Min. : 0.00 Mode :logical
## Class :character 1st Qu.:100.0 1st Qu.:34.00 FALSE:22808
## Mode :character Median :106.0 Median :40.00 TRUE :11668
## Mean :104.9 Mean :40.46
## 3rd Qu.:113.0 3rd Qu.:47.00
## Max. :120.0 Max. :80.00
## player.id player.name position.id position.name
## Min. : 4633 Length:34476 Min. : 1.00 Length:34476
## 1st Qu.:15554 Class :character 1st Qu.: 3.00 Class :character
## Median :15709 Mode :character Median :10.00 Mode :character
## Mean :14919 Mean :10.48
## 3rd Qu.:17275 3rd Qu.:16.00
## Max. :24931 Max. :25.00
NAs removed, categorical variables adjusted, Target variable added (is_goal): created from outcome.name variable.
shots.df[is.na(shots.df)] <- FALSE
shots.df$is_goal <- 0
shots.df$is_goal[shots.df$shot.outcome.name == "Goal"] <- 1
The 2 dataframes are joined, and the needed features for the analysis are created.
cleaned.data <- shots.df %>%
rename(x= location.x, y= location.y)%>%
mutate(shot.distance = sqrt((120 - x)^2 + (40 - y)^2),
shot.angle = atan(7.32*(120 - x)/((120 - x)^2+(40 - y)^2-(7.32/2)^2))* 180/pi) %>%
modify_if(is.character, as.factor)
# Join the 2 DFs at shot id, and extract features about opponents positions:
## (number of opponents in between the player and the goal, number of defenders, position of goalkeeper).
joined.df <- merge(cleaned.data, frames.df, by = "id") %>%
rename(other.x= location.x, other.y= location.y, other.position = position.name.y)%>%
mutate(other.distance = sqrt((120 - other.x)^2 + (40 - other.y)^2),
other.in.space = ifelse(other.distance < shot.distance, TRUE, FALSE),
opp.in.space = ifelse(other.in.space & !teammate, TRUE, FALSE),
defenders.in.space = ifelse(opp.in.space & grepl("Defensive", other.position), TRUE, FALSE))
eng.df <- joined.df %>%
group_by(id,player.name.x,team.name,
minute,second,possession,duration,under_pressure,
play_pattern.name, shot.body_part.name,
shot.technique.name, shot.type.name, match_id,
shot.open_goal, shot.first_time, shot.redirect, shot.deflected, shot.saved_to_post,
shot.distance, shot.angle,shot.statsbomb_xg, is_goal) %>%
summarize(sum.opp.in.space = sum(opp.in.space),
sum.defenders.in.space = sum(defenders.in.space),
goal_keeper_distance = other.distance[other.position == "Goalkeeper"])
After feature engineering and joining the dataframes, data is explored again to discover relations and correlations between the variables are explored.
correlation matrix
Next, a model is created to make predictions of expected goals. First, the data is split into train and test. Then, logistic regression is applied to the data to generate probabilities of shots being scored. The features used to build this model are: shot.distance, shot.angle, sum.opp.in.space, sum.defenders.in.space, and goal_keeper_distance.
# split the data into train & test
data <- eng.df
set.seed(101)
# Selecting 80% as train data
sample <- sample.int(n = nrow(data), size = floor(.8*nrow(data)), replace = F)
train <- data[sample, ]
test <- data[-sample, ]
# Build the model: logistic regression
mod <- glm(is_goal ~ shot.distance + shot.angle + sum.opp.in.space + sum.defenders.in.space + goal_keeper_distance, data=train, family=binomial)
summary(mod)
##
## Call:
## glm(formula = is_goal ~ shot.distance + shot.angle + sum.opp.in.space +
## sum.defenders.in.space + goal_keeper_distance, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4192 -0.5105 -0.3226 -0.1902 3.8031
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.271496 0.239924 -1.132 0.258
## shot.distance -0.062306 0.012379 -5.033 4.82e-07 ***
## shot.angle -0.001318 0.003136 -0.420 0.674
## sum.opp.in.space -0.252589 0.047312 -5.339 9.35e-08 ***
## sum.defenders.in.space -0.123378 0.192083 -0.642 0.521
## goal_keeper_distance 0.095040 0.023477 4.048 5.16e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1490.9 on 2218 degrees of freedom
## Residual deviance: 1276.1 on 2213 degrees of freedom
## AIC: 1288.1
##
## Number of Fisher Scoring iterations: 6
# getting the intercepts of the model
int <- coef(mod)
int_coef <- int[1]
dist_coef <- int[2]
ang_coef <- int[3]
opp_coef <- int[4]
def_coef <- int[5]
gk_coef <- int[6]
# giving the xG value to the shots
for (i in seq(1,nrow(data))){
sum = int_coef + ang_coef*data[i,"shot.angle"] + (dist_coef*data[i,"shot.distance"])+
(opp_coef*data[i,"sum.opp.in.space"])+ (def_coef*data[i,"sum.defenders.in.space"])+
(gk_coef*data[i,"goal_keeper_distance"])
data[i,"xG"] = exp(sum)/(1+exp(sum))
}
data
## # A tibble: 2,774 x 26
## # Groups: id, player.name.x, team.name, minute, second, possession, duration,
## # under_pressure, play_pattern.name, shot.body_part.name,
## # shot.technique.name, shot.type.name, match_id, shot.open_goal,
## # shot.first_time, shot.redirect, shot.deflected, shot.saved_to_post,
## # shot.distance, shot.angle, shot.statsbomb_xg, is_goal [2,769]
## id player.name.x team.name minute second possession duration
## <fct> <fct> <fct> <int> <int> <int> <dbl>
## 1 0024316a-8bbe~ Vivianne Mied~ Arsenal WFC 93 18 162 0.354
## 2 002e8652-613e~ Hannah Cain Everton LFC 88 50 217 2.29
## 3 003b0566-1e1d~ Nikita Parris Manchester C~ 3 23 10 0.351
## 4 0047b26b-5f8d~ Melissa Lawley Manchester C~ 35 28 82 0.678
## 5 005aa8fd-7fc8~ Kayleigh Green Brighton & H~ 2 18 9 1.85
## 6 007260ed-38c4~ Christie Murr~ Liverpool WFC 14 4 35 1.20
## 7 00772dae-2c12~ Abigail Harri~ Bristol City~ 49 50 92 0.747
## 8 00781c4f-9579~ Angharad James Everton LFC 56 17 142 1.85
## 9 00849624-27dd~ Nikita Parris Manchester C~ 14 38 28 1.38
## 10 0084c484-a5ed~ Brooke Hendrix West Ham Uni~ 56 43 120 0.874
## # ... with 2,764 more rows, and 19 more variables: under_pressure <lgl>,
## # play_pattern.name <fct>, shot.body_part.name <fct>,
## # shot.technique.name <fct>, shot.type.name <fct>, match_id <int>,
## # shot.open_goal <lgl>, shot.first_time <lgl>, shot.redirect <lgl>,
## # shot.deflected <lgl>, shot.saved_to_post <lgl>, shot.distance <dbl>,
## # shot.angle <dbl>, shot.statsbomb_xg <dbl>, is_goal <dbl>,
## # sum.opp.in.space <int>, sum.defenders.in.space <int>,
## # goal_keeper_distance <dbl>, xG <dbl>
Next, some useful visualizations are created to plot the new metric (xG) against different variables and compare it to actual goals.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
A plot visualizing shots spread versus xG to see where the majority of shots lie
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Drawn at threshold = 0.3.
# Compare actual goals and expected goals across different matches for "Manchester City WFC"
# at threshold = 0.3
actual.to.xG <- data %>% filter(team.name == "Manchester City WFC") %>%
group_by(match_id)%>%
mutate(xG.as.goals = ifelse(xG >= 0.3, 1, 0))%>%
summarise(actual.goals = sum(is_goal), expected.goals = sum(xG.as.goals)) %>%
melt(id = c("match_id"))%>%
ggplot(aes(match_id, value, color = variable)) +
geom_smooth()+
labs(title= "Actual Goals vs. Expected Goals for Manchester City WFC",x ="Match ID", y = "Goals Count")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'